Skip to content

Steal the algorithm used to combine hashes from tupleobject.c #15227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mikegraham
Copy link
Contributor

closes #15224

@mikegraham
Copy link
Contributor Author

Here's an initial pass at stealing https://github.com/python-git/python/blob/master/Objects/tupleobject.c#L290 for the combining. I am not 100% that the problem is my (rather crude) combiner, but possibly the exact way we're using the bitmixer in hash_array. I'm trying to think about it........I think we might be maintaining undesirable linearity.

May I ask, how did you encounter these collisions?

@mikegraham
Copy link
Contributor Author

If the basic approach looks sound I can add some comments around some of the lazy iterator wackiness.

arrays = itertools.chain([first], arrays)

mult = np.zeros_like(first) + np.uint64(1000003)
out = np.zeros_like(first) + np.uint64(0x345678L)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the L is not working in py3. (remove it and its ok).

@jreback
Copy link
Contributor

jreback commented Jan 25, 2017

@mikegraham the collisions I found by hashing

In [3]: i = pd.MultiIndex.from_product([np.arange(1000),np.arange(1000)],names=['one','two'])

In [4]: i.to_dataframe(index=False) 

which is basically a cartesian product of 1000 x 1000. nothing special really, just a test case I am using.

@jreback jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jan 25, 2017
@mikegraham mikegraham force-pushed the emulate_tuple branch 2 times, most recently from 7117b6b to e52c872 Compare January 25, 2017 21:10
@jreback jreback added this to the 0.20.0 milestone Jan 25, 2017
@jreback
Copy link
Contributor

jreback commented Jan 25, 2017

closing in favor of in #15224

thanks @mikegraham

@jreback jreback closed this Jan 25, 2017
jreback added a commit that referenced this pull request Jan 27, 2017
closes #15227

Author: Jeff Reback <[email protected]>
Author: Mike Graham <mikegraham2gmail.com>

Closes #15224 from jreback/mi_hash2 and squashes the following commits:

8b1d3f9 [Jeff Reback] not correctly hashing categorical in a MI
48a2402 [Jeff Reback] support for mixed type arrays
58f682d [Jeff Reback] memory optimization
0c13df7 [Mike Graham] Steal the algorithm used to combine hashes from tupleobject.c
e8dd607 [Jeff Reback] add hash_tuples
44e9c7d [Mike Graham] wipSteal the algorithm used to combine hashes from tupleobject.c
e507c4a [Jeff Reback] ENH: support MultiIndex and tuple hashing
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
closes pandas-dev#15227

Author: Jeff Reback <[email protected]>
Author: Mike Graham <mikegraham2gmail.com>

Closes pandas-dev#15224 from jreback/mi_hash2 and squashes the following commits:

8b1d3f9 [Jeff Reback] not correctly hashing categorical in a MI
48a2402 [Jeff Reback] support for mixed type arrays
58f682d [Jeff Reback] memory optimization
0c13df7 [Mike Graham] Steal the algorithm used to combine hashes from tupleobject.c
e8dd607 [Jeff Reback] add hash_tuples
44e9c7d [Mike Graham] wipSteal the algorithm used to combine hashes from tupleobject.c
e507c4a [Jeff Reback] ENH: support MultiIndex and tuple hashing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants